NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Chi, Cheng; Xu, Zhenjia; Pan, Chuer; Cousineau, Eric; Burchfiel, Benjamin; Feng, Siyuan; Tedrake, Russ; Song, Shuran (March 2024, Robotics: Science and Systems)

Full Text Available
Self-supervised Semantic-driven Phoneme Discovery for Zero-resource Speech Recognition

https://doi.org/10.18653/v1/2022.acl-long.553

Wang, Liming; Feng, Siyuan; Hasegawa-Johnson, Mark; Yoo, Chang (January 2022, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))
Muresan, Smaranda; Nakov, Preslav; Villavicencio, Aline (Ed.)
Phonemes are defined by their relationship to words: changing a phoneme changes the word. Learning a phoneme inventory with little supervision has been a longstanding challenge with important applications to under-resourced speech technology. In this paper, we bridge the gap between the linguistic and statistical definition of phonemes and propose a novel neural discrete representation learning model for self-supervised learning of phoneme inventory with raw speech and word labels. Under mild assumptions, we prove that the phoneme inventory learned by our approach converges to the true one with an exponentially low error rate. Moreover, in experiments on TIMIT and Mboshi benchmarks, our approach consistently learns a better phoneme-level representation and achieves a lower error rate in a zero-resource phoneme recognition task than previous state-of-the-art self-supervised representation learning algorithms.
more » « less
Full Text Available
Discovering phonetic inventories with crosslingual automatic speech recognition

https://doi.org/10.1016/j.csl.2022.101358

Żelasko, Piotr; Feng, Siyuan; Moro Velázquez, Laureano; Abavisani, Ali; Bhati, Saurabhchand; Scharenborg, Odette; Hasegawa-Johnson, Mark; Dehak, Najim (July 2022, Computer Speech & Language)

Full Text Available
How Phonotactics Affect Multilingual and Zero-Shot ASR Performance

https://doi.org/10.1109/ICASSP39728.2021.9414478

Feng, Siyuan; Zelasko, Piotr; Moro-Velazquez, Laureano; Abavisani, Ali; Hasegawa-Johnson, Mark; Scharenborg, Odette; Dehak, Najim (June 2021, ICASSP)
null (Ed.)
The idea of combining multiple languages’ recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system’s performance, and retaining only the target language’s phonotactic data in LM training is preferable.
more » « less
Full Text Available

Search for: All records